Strategies for Reducing and Correcting OCR Errors

نویسندگان

  • Martin Volk
  • Lenz Furrer
  • Rico Sennrich
چکیده

In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Since these books have come out continuously, they represent a unique basis for historical, cultural and linguistic research. We used commercial OCR systems for the conversion from the scanned images to searchable text. This poses several challenges. For example, the built-in lexicons of the OCR systems do not cover the 19th century German spelling, the Swiss German spelling variants and the plethora of toponyms that are characteristic of our text genre. We also realized that different OCR systems make different recognition errors. We therefore run two OCR systems over all our scanned pages and merge the output. Merging is especially tricky at spots where both systems result in partially correct word groups. We describe our strategies for reducing OCR errors by enlarging the systems’ lexicons and by two post-correction methods namely, merging the output of two OCR systems and autocorrection based on additional lexical resources. DOI: https://doi.org/10.1007/978-3-642-20227-8_1 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-54277 Accepted Version Originally published at: Volk, Martin; Furrer, Lenz; Sennrich, Rico (2011). Strategies for reducing and correcting OCR errors. In: Sporleder, Caroline; van den Bosch, Antal; Zervanou, Kalliopi. Language Technology for Cultural Heritage. Berlin: Springer, 3-22. DOI: https://doi.org/10.1007/978-3-642-20227-8_1 Strategies for Reducing and Correcting OCR Errors Martin Volk, Lenz Furrer and Rico Sennrich Abstract In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Since these books have come out continuously, they represent a unique basis for historical, cultural and linguistic research. We used commercial OCR systems for the conversion from the scanned images to searchable text. This poses several challenges. For example, the built-in lexicons of the OCR systems do not cover the 19th century German spelling, the Swiss German spelling variants and the plethora of toponyms that are characteristic of our text genre. We also realized that different OCR systems make different recognition errors. We therefore run two OCR systems over all our scanned pages and merge the output. Merging is especially tricky at spots where both systems result in partially correct word groups. We describe our strategies for reducing OCR errors by enlarging the systems’ lexicons and by two post-correction methods namely merging the output of two OCR systems and auto-correction based on additional lexical resources.In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Since these books have come out continuously, they represent a unique basis for historical, cultural and linguistic research. We used commercial OCR systems for the conversion from the scanned images to searchable text. This poses several challenges. For example, the built-in lexicons of the OCR systems do not cover the 19th century German spelling, the Swiss German spelling variants and the plethora of toponyms that are characteristic of our text genre. We also realized that different OCR systems make different recognition errors. We therefore run two OCR systems over all our scanned pages and merge the output. Merging is especially tricky at spots where both systems result in partially correct word groups. We describe our strategies for reducing OCR errors by enlarging the systems’ lexicons and by two post-correction methods namely merging the output of two OCR systems and auto-correction based on additional lexical resources. Martin Volk Institute of Computational Linguistics, University of Zurich, e-mail: [email protected] Lenz Furrer Institute of Computational Linguistics, University of Zurich, e-mail: [email protected] Rico Sennrich Institute of Computational Linguistics, University of Zurich, e-mail: [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Survey on Various OCR Errors

Research has been carried out in correcting words in OCR text and mainly surrounds around (1) non word errors (2) isolated word error correction and context dependent word correction. Various kinds of techniques have been developed. This papers surveys various techniques in correcting these errors and determines which techniques are better. General Terms Optical Character Recognition, Natural L...

متن کامل

An expert system for automatically correcting OCR output

This paper describes a new expert system for automatically correcting errors made by optical character recognition (OCR) devices. The system, which we call the post-processing system, is designed to improve the quality of text produced by an OCR device in preparation for subsequent retrieval from an information system. The system is composed of numerous parts: an information retrieval system, a...

متن کامل

A practical implementation of automatic text categorisation and correction for the conversion of noisy OCR documents into braille and large print

A novel text categorisation method called Cmeasure is applied to the problem of automatically correcting standard blocks of noisy OCR text within structured documents such as credit card statements and standardised letters. The blocks of text in the scanned image are first identified then classified using the C-Measure algorithm against a small set of known correct text. The text block is subse...

متن کامل

Correcting English Text Using PPM Models

An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized; while for spelling correction, two characters may be transposed, or a character may be inadverte...

متن کامل

Diploma Thesis: Unsupervised Post-Correction of OCR Errors

The trend to digitize (historic) paper-based archives has emerged in the last years. The advantages of digital archives are easy access, searchability and machine readability. These advantages can only be ensured if few or no OCR errors are present. These errors are the result of misrecognized characters during the OCR process. Large archives make it unreasonable to correct errors manually. The...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011